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Abstract 

Kernel Estimation provides an unbinned and non-parametric estimate of the prob- 
ability density function from which a set of data is drawn. In the first section, after 
a brief discussion on parametric and non-parametric methods, the theory of Kernel 
Estimation is developed for univariate and multivariate settings. The second sec- 
tion discusses some of the applications of Kernel Estimation to high-energy physics. 
The third section provides an overview of the available univariate and multivariate 
packages. This paper concludes with a discussion of the inherent advantages of ker- 
nel estimation techniques and systematic errors associated with the estimation of 
parent distributions. 
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KEYS, RootPDE, WinPDE, PDE, HEPUKeys,Unbinned, Non-Parametric 



1 Introduction 



1.1 Overview 



Perhaps the most common practical duty of a particle physicist is to analyze 
various distributions from a set of data {U}. The typical tool used in this anal- 
ysis is the histogram. The role of the histogram is to serve as an approximation 
of the parent distribution, or probability density function (pdf) from which 
the data were drawn. While histograms are straightforward and computation- 
ally efficient, there are many more sophisticated techniques which have been 
developed in the last century. One such method, kernel estimation, grew out 
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of a simple generalization of the histogram and has proved to be particularly 
well-suited for particle physics. 

In order to produce continuous estimates f(x) of the parent distribution from 
the empirical probability density function epdf (x) = J2i $( x ~ ^i)> several tech- 
niques have been developed. These techniques can be roughly classified as ei- 
ther parametric or non-parametric. Essentially, a parametric method assumes 
a model f(x; a) dependent on the parameters a = (a±, a<i, «3, . . .). The speci- 
fication of this model is "entirely a matter for the practical [physicist]^' . The 
goal of a parametric estimate is to optimize the parameters aj with respect 
to some goodness-of-fit criterion (e.g. x 2 , log-Likelihood, etc.). Parametric 
models are powerful because they allow us to infuse our model with our knowl- 
edge of physics. While parametric methods are very powerful, they are highly 
dependent on the specification of the model. Parametric methods are clearly 
not practical for estimating the distributions from a wide variety of physical 
phenomena. 

The goal of non-parametric methods is to remove the model-dependence of the 
estimator. Non-parametric estimates are concerned directly with optimizing 
the estimate f(x). The prototypical non-parametric density estimate is the 
histogram^]. Somewhat counterintuitively, non-parametric methods typically 
involve a large - possibly infinite - number of "parameters" (better thought of 
as degrees of freedom). Scott and Terrell supplied a more concrete definition 
of a non-parametric estimator, "Roughly speaking, non-parametric estimators 
are asymptotically local, while parametric estimators are not."([l]) That is 
to say, the influence of a data point £j on the density at x should vanish 
asymptotically (in the limit of an infinite ammount of data) for any \x — ti\ > 
in a non-parametric estimate. The purpose of this paper is to introduce the 
notion of a kernel estimator and the inherent advantages it offers over other 
parametric and non-parametric estimators. 

1.2 Kernel Estimation 

The notion of a kernel estimator grew out of the asymptotic limit of Averaged 
Shifted Histograms (ASH). The ASH is a simple device that reduces the bin- 
ning effects of traditional histograms. The ASH algorithm is as follows: First, 
create a family of N histograms, {Hi}, with bin- width h, such that the first 
bin of the i th histogram is placed at Xq + ih/N. Because Xq is an artificial 
parameter, each of the Hi is an equally good approximation of the parent 
distribution. Thus, an obvious estimate of the parent distribution is simply 
the average of the Hi, hence the name 'Average Shifted Histogram'. Note that 

2 Prom a debate between R.A. Fisher and Karl Pearson 

3 The name 'histogram' was coined by Karl Pearson 
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resulting estimate (with N times more bins than the original) is not a true 
histogram, because the height of a 'bin' is not necessarily equal to the number 
of events falling in that bin. However, it is a superior estimate of the parent 
distribution, because the dependence of initial bin position is essentially re- 
moved. In the limit N — > oo the ASH is equivalent to placing a triangular 
shaped kernel of probability about each data point t% ([!]). 



1.2.1 Fixed Kernel Estimation 



In the univariate case, the general kernel estimate of the parent distribution 
is given by 



where {tj} represents the data and h is the smoothing parameter (also called 
the bandwidth) . Immediately we can see that our estimate fo is bin- independent 
regardless of our choice of K. The role of K is to spread out the contribution 
of each data point in our estimate of the parent distribution. An obvious and 
natural choice of K is a Gaussian with /i = and a = 1: 



K(x) = -=e~ x l2 . (2) 



Though there are many choices of K, Gaussian kernels enjoy the attributes 
of being positive definite, infinitely different iable, and defined on an infinite 
support. For physicists this means that our estimate fo is smooth and well- 
behaved in the tails. 

Now we concern ourselves with the choice of the bandwidth h. In Equation 1, 
the bandwidth is constant for all i. Thus, fo is referred to as the fixed kernel 
estimate. The role of h is to set the scale for our kernels. Because the kernel 
method is a non-parametric method, h is completely specified by our data set 
{ti}. In the limit of a large amount (n — > oo) of normally distributed data ([!]), 
the mean integrated squared error of fo is minimized when 

/4\ 1/5 

h* = (-J an- 1 / 5 . (3) 



Of course, we rarely deal with normally distributed data, and, unfortunately, 
the optimal bandwidth h* is not known in general. In the case of highly bi- 
modal data (e.g. the output of a neural network discriminate), the standard 
deviation of the data is not a good measure for the scale of the true structure 
of the distribution. 
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1.2.2 Adaptive Kernel Estimation 

An astute reader may object to the choice of h* given in Equation 3 on the 
grounds of self-consistency - non-parametric estimates should only depend on 
the data locally, and a is a global quantity. In order for the estimate to handle 
a wide variety of distributions as well as depend on the data only locally, we 
must introduce adaptive kernel estimation. The only difference in the adap- 
tive kernel technique is that our bandwidth parameter is no longer a global 
quantity We require a term that acts as a\ oca \ in Equation 3. Abramson (||) 
proposed an adaptive bandwidth parameter given by the expression 

hi = h/JJ(t~). (4) 



Equation 4 reflects the fact that in regions of high density we can accurately 
estimate the parent distribution with narrow kernels, while in regions of low 
density we require wide kernels to smooth out statistical fluctuations in our 
empirical probability density function. Technically we are left with two out- 
standing issues: i) the expression for hi given in Equation 4 references the 
a priori density, which we do not know, and ii) the optimal choice of h has 
still not been specified. Clearly, h* oc y/a, because of dimensional analysis. 
Additionally, f is our best estimate of the true parent distribution. Thus we 
obtain 

h(x) = \±^K{^±), (5) 



with 




(6) 



The adaptive kernel estimate can be thought of as a "second iteration" of the 
general kernel estimation technique. In practice, the adaptive kernel technique 
almost completely removes any dependence on the original choice of the band- 
width in the fixed kernel estimate fo- Furthermore, the adaptive kernel deals 
very well with multi-modal distributions. In extreme situations (i.e. when the 
scale of the local structure of the data tx local is more than about two orders of 
magnitude smaller than the standard deviation cr of the data) the factor p in 
Equation 6 should be adjusted from its typical value of unity. In that case 



,> J^f- (7) 
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We have concluded the construction of a non-parametric estimate f\ of an uni- 
variate parent distribution based on the empirical probability density function. 
Our estimate is bin- independent, scale invariant, continuously differentiable, 
positive definite, and everywhere defined. 

1.2.3 Boundary Kernels 

Both the fixed and adaptive kernel estimates assume that the domain of 
the parent distribution is all of R. However, the output of a neural net- 
work discriminant, for example, is usually bounded by < x < 1, where 
f{x < 0) = f{x > 1) = 0. In order to avoid probability from "spilling out" of 
the boundaries we must introduce the notion of a boundary kernel. Without 
boundary kernels, our estimate will not be properly normalized and underes- 
timate the true parent distribution close to the boundaries. 

Boundary kernels modify our traditional Gaussian kernels so that the total 
probability in the allowed regions is unity. Clearly, our kernel should smoothly 
vary back to our original Gaussian kernels as we move far from the bound- 
aries. This constraint quickly reduces the kinds of boundary kernels we need 
consider. Though a large amount of work has been put forward to introduce 
kernels which preserve the criteria 

oo 

J tK(t)dt = 0, (8) 

— oo 

these methods do not suit themselves well for physics applications. The pri- 
mary problem is that the parametrized family of boundary kernels may con- 
tain kernels that are not positive definite - which negates their applicability to 
physics. Also, boundary kernels satisfying Equation 8 systematically underes- 
timate the parent distribution at a moderate distance from the boundary and 
overestimate very near the boundary. 

An alternate solution to the boundary problem is to simply reflect the data 
set about the boundary ([!]). In that case, the probability that spills out of the 
boundary is exactly compensated by its mirror. 

1.3 Multivariate Kernel Estimation 

The general kernel estimation technique generalizes to d-dimensions ([!]). One 
choice for the d- dimensional kernel is simply a product of univariate kernels 
with independent smoothing parameters. The following discussion will be re- 
stricted to the context of such product kernels. 
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Fig. 1. The performance of boundary kernels on a Neural Network distribution with 
a hard boundary 

1.3.1 Covariance Issues 



When dealing with multivariate density estimation, the covariance structure of 
the data becomes an issue. Because the covariance structure of the data may 
not match the diagonal covariance structure of our kernels, we must apply 
a linear transformation which will diagonalize the covariance matrix of 
the data. Ideally, the transformation would remain a local object; however, in 
practice such non-linear transformations may be very difficult to obtain. In 
the remainder of this paper, the transformation matrix will be referred to as 
Ajk, and the {ti} will be assumed to be transformed. 



1.3.2 Fixed Kernel Estimation 



For product kernels, the fixed kernel estimate is given by 



fo(x) 



L - V 

nhi . . . h d fr! 



(9) 



In the asymptotic limit of normally distributed data, the mean integrated 
squared error of /o is minimized when 

h > = (ih) ^~ ini+1) - (l0) 



1.3.3 Adaptive Kernel Estimation 

The adaptive kernel estimate fi(x) is constructed in a similar manner as the 
univariate case; however, the scaling law is usually left in a general form. 
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Because most multivariate data actually lies on a lower dimensional manifold 
embedded in the input space, the effective dimensionality d' must be found by 
maximizing some measure of performance or making some assumption. Thus 
the multivariate adaptive bandwidth is usually written 

h l = hf- 1 ' d \i i ). (11) 

Though d! pa d, the precise value depends on the problem. Note that the form 
of hi given in Equation 11 is independent of j, thus it produces spherically 
symmetric kernels. This is clearly not optimal. Furthermore, when d! 7^ d the 
optimal value of h may vary wildly. This is because the units are no longer 
correct and (d/d r ) powers of scale factors are introduced by f~ 1 ! d ' ' . Both of 
these problems may be remedied with the introduction of a natural length 
scale associated with the data: the geometric mean of the standard deviations 
of the transformed {£«}, o = det(AT l A T ) 1 ^ 2d . In the absence of local covariance 
information, the best we can do is assume that the hj are proportional to aj 
and inversely proportional to f 1 ^' . Thus we arrive at 

K 3 = (^) (^) rWVf-Wfa (12) 

which is produces estimates that are invariant under linear-transformation of 
the input space when the covariance matrix is diagonalized. 

1.3.4 Multivariate Boundary Kernels 

Just as in the univariate case, it is possible that the physically realizable 
domain of our parent distribution is not all of M. d , but instead a bounded 
subspace of M, d . Typically, this situation arises when one of the components of 
the sample vector is bounded in the univariate sense (i.e. tj < x™ ax ). However, 
once we diagonalize the covariance matrix of our data the boundary condition 
will take on a new form in the transformed coordinates. In general, any linear 
boundary in our original coordinates Xj can be expressed as CjXj = C, where Cj 
is the unit-normal to the (d — l)-dimensional hyperplane in our d-dimensional 
domain and C is the distance between the origin and the point-of-closest 
approach. After transforming to a set of coordinates x'j = Aj k x k , in which the 

— * 

{ti} have diagonal covariance, our boundary condition is given by 

djXj = CkA^j Aj k Xk = C. Thus, for each boundary one must introduce a 

reflected sample {tj efl } with 

tf = t ij + 2(C-d k t ik )d j , (13) 
in order to rectify the probability that spilled into unphysical regions. 
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1.3.5 Event-by-Event Weighting 



In high-energy physics it is often necessary to combine data from hetero- 
geneous sources (e.g. independently produced Monte Carlo data sets which 
together comprise the Standard Model expectation). In general one would like 
to estimate the parent distribution from a more general empirical probability 
density function epdf (x) = J2i WiS(x — U), where Wi represents the weight or a 
posteriori probability of the i th event. In the case of combining various Monte 
Carlo samples, one must reweight all events of a sample to some common 
luminosity (say, 1 pb _1 ) before combining them. Thus for a Monte Carlo sam- 
ple with umc events and cross-section o~ MC each event must be weighted with 
Wi = l (pb^ 1 )//^ = or MC /n MC 

The covariance matrix of the weighted sample must be generalized as follows: 

v. = (Mi ~ ^j)(Uk - /ifc) ^ = (Uj - fij)(Uk - fik) 

i=i 11 i=i " 



where n = Y^i=i w i an d fi = Y^i=\ w iti/ n - Then our estimate is simply given 
by 



1 n 



n 7=i 



j=i n,ij 



Xj tij 
hij 



(15) 



2 Applications 



Kernel estimation techniques are applicable to all situations in which it is 
necessary or useful to have an estimate of the parent distribution of a set of 
data. In high-energy physics the use of kernel estimation runs the gamut from 
event selection to confidence level calculation. 



2.1 Confidence Level Calculations 



The most wide-spread use of kernel estimation techniques has been in the con- 
text of confidence level calculations, primarily for the Higgs Searches at LEP 
(H? ; |j? ? ). Each selected candidate event has associated with it one or several 
discriminant variables (e.g. reconstructed Higgs mass, neural network output, 
etc.). Estimates of the distribution of discriminant variables are constructed 
for the signal and background processes, £ sig and £back respectively. These 
estimates are used to further discriminate between candidate events via the 
log-Likelihood ratio log(£ S i g /£ S i g ) or some other 'signal estimator'. With the 
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log-Likelihood in hand, confidence level calculations transcend pure number 
counting and allow for a more powerful statistical tool with which to interpret 
the data (|5|). 

2. 2 Measurements 

2.2.1 Maximum Likelihood Fits 

Another context in which kernel estimation has been applied is the measure- 
ment of physical constants via maximum likelihood fitting. Traditionally, the 
log-Likelihood log£ = J2i l°g/(^j «) is maximized with respect to the param- 
eters a = (ai, a 2 , a 3 , . . .). In this context, /(tj; a) is a parametrized model 
of the physical situation. In practice not all of the aj are 'floated' or varied 
in the maximization routine, but instead many parameters are 'fixed' from 
some independent measurement. While this model incorporates empirical or 
theoretical information, it may make unwanted assumptions about our data. 

For an example, let us consider the measurement of sin 2(3 at a B factory. 
The probability density of a CP decay recoiling from a tagged B (B) meson 
is given by 

f{t-(3) = e~ r| *l(l ± sin 2(3 sin Ami), (16) 

where t is the time difference between the decay of the CP state and the re- 
coiling B (B) tagged meson with Az = / -f/3ct. However, in an experiment we 
must take into account the mistag rate w and the resolution of Az. The stan- 
dard prescription is to measure w and parametrize the resolution distribution 
R(Az truc — Az TCCO ) with a single (or double) Gaussian with bias 5 and vari- 
ance a. The final probability distribution is obtained via a convolution with 
the resolution function and is of the form f(t; w, 5, a, (3) = R(5, a) <g) f(t; w, (3). 
Now with w, 6, and a 'fixed' we must 'float' (3 to make our measurement (^). 
Here the form of R, while justified, will have a systematic influence on the 
measured value of sin 2(3. If, on the other hand, the resolution function R 
was estimated via a non-parametric means {i.e. kernel estimation techniques), 
then there would be no artificial influence on the measurement and non-trivial 
resolution effects would be taken into account automatically. 

2.2.2 N on- Parametric Parameter Estimation 

Knuteson proposed a non-parametric parameter estimation technique based on 
kernel estimation of the joint density of feature vectors and the parameters a 
to be measured. Via Bayesian statistics the posterior density of the parameters 
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given an observation is obtained. Then one estimates the parameters by simply 
maximizing the posterior density (0) . 



2.3 Discriminant Analysis 



The first uses of kernel estimation in high-energy physics were in the context of 
new particle searches. In order to create a discriminant analysis, one constructs 
the discriminant D{x) according to: 

£>(£) = . (17) 

fsig(x) + /back(^) 



It is easily seen that signal-like events get mapped to 1 and background-like 
events are mapped to 0. By placing a cut on D{x) a (possibly disconnected) 
region (bounded by the corresponding {d — l)-dimensional contour) in the in- 
put space is selected. Because of the geometrical complexity of the selected 
region, analyses of this type are more efficient than linear discriminant analy- 
ses. Furthermore, because of the intuitive nature of Equation 17, many prefer 
analyses of this type over Artificial Neural Networks. While these analyses may 
avoid the black-box properties of Artificial Neural Networks, their usefulness 
is limited to d < 6 due to the curse of dimensionality. However, with more 
intelligently chosen variables, the practical restriction of d < 6 is not much of 
a limitation. Discriminant analyses such as these were met with much success 
in the search for the top quark at D0 (Ft K [TO). 



2-4 Event Selection and Cut Optimization 



An obvious application of kernel estimation techniques is the improvement 
and automation of cut optimization. Given signal and background samples, 
one typically varies cuts to find a region 1Z which maximizes some measure 
of performance {e.g. S/B, S/yB, etc.). Traditionally, the cuts are applied 
directly to the sample events, thus only discrete values of the performance 
measure are obtained. This leaves the experimentalist in a bit of a quandary 
because in the neighborhood of the optimal cut value the performance measure 
tends to have many local extrema and is constant between sample points. Not 
surprisingly, the optimal cut value is often just chosen by eye. 

The above method of cut optimization may be generalized by estimating the 
performance not from the raw samples, but instead from the estimates of the 
signal and background parent distributions, / sig and /back respectively. Then 
instead of having discrete values of S and B we obtain continuous functions 
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S(TZ) and B(JZ). Finally, we are left with the more well-defined problem of 
finding the global maxima of a continuously varying performance measure - 
in the case of S/ VB we obtain 



1Z = argmax <{ = . 

/n hack J-^ f hSLCk (x)dx ) 



In fact, it should be pointed out that with a finite sample size there is an inher- 
ent uncertainty in / sig and /back- Thus, there is an an inherent uncertainty in 1Z. 
In that case, the optimal selection strategy becomes probabilistic - events on 
the boundary TZ should be selected with some well defined probability While 
this strategy can be carried out in a deterministic fashion {e.g. generating 
pseudo-random numbers with event and run number as a random seed) the 
gain in performance does not merit the added complexity of the selection. 



3 Available Packages 

3.1 Univariate Packages 
3.1.1 KEYS 

In order to implement univariate kernel estimation techniques to produce 
FORTRAN functions of f\ based on the data in HBOOK ntuples, KEYS was 
developed. KEY3^\ has been implemented in a Perl script which: 1) parses an 
input file specifying the input ntuples and output filenames of a set of shapes,^ 
2) interfaces with PAW to obtain the data set {U}, 3) produces a FORTRAN 
function to calculate fi(x), 4) compiles and evaluates the FORTRAN function, 
5) produces plots (see Figure 2) of the estimate and the Kolmogorov-Smirnov 
probability that the {ti} is compatible with fi (]TT|), and 6) produces a sec- 
ond FORTRAN function which linearly interpolates f\ between 300 sample 
points in order to expedite the evaluation of the function. KEYS has been 
used extensively for the parametrization of discriminant variable distributions 
in Higgs searches at LEP (H? ; §? ? ? ). 



4 KEYS is an acronym for Kernel Estimating Your Shapes 

5 The word shape is lingo for the continuous estimate of the parent distribution 
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Fig. 2. The standard output of the KEYS script. The top left plot shows the cu- 
mulative distributions of the KEYS shape and the data. The top right plot shows 
the difference between the two cumulative distributions, the maximum of which is 
used in the calculation of the Kolmogorov-Smirnov test. The bottom plot shows the 
shape produced by KEYS overlayed on a histogram of the original data. 

3.1.2 HEP U Keys 

Another implementation of univariate kernel estimation has been developed 
by Jeremiah Mans of the L3 collaboration.^ This implementation is written 
in C ++ . The HEPUKeys class executes a C-compiler, and uses the dynamic 
loading methods (dlopen, dlclose, dlsym) for interactive or embedded imple- 
mentations of the kernel estimation technique. This embedded structure allows 
one to include confidence level calculation, cut optimization, or any other ap- 
plication of kernel estimation directly in the analysis code. 



6 electronic mail: Jeremy.Mans@cern.ch 
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3.2 Multivariate Packages 



3.2.1 PDE 

The original implementation of multivariate kernel estimation in high-energy 
physics was developed at Rice University by Holmstrom, MiettinenQ and 
Sain with guidance from Scott around 1994. This VMS-based FORTRAN 
implementation, called PDE, was applied to searches for the top quark at D0 

SB ED- 

3.2.2 RootPDE 

In order to produce a more portable and flexible package for kernel estimation, 
the author developecQan object oriented implementation written in C ++ . The 
core numerical procedures were left in a very general form with polymorphic 
kernels and minimal user interface. The core procedures have been adopted 
by two complete packages: RootPDE and WinPDE. 

Padley and Askew^] interfaced the core numerical procedures described above 
into a shared Root library called RootPDE. This package was primarily in- 
tended for discriminant analyses as described in Section 2.3. However, this 
package was submitted to D0's CVS repository prior to the development of 
Equation 12 (see Section 1.3.3) and does not support event-by-event weighting 
(see Section 1.3.5) or boundary kernels (see Sections 1.2.3 and 1.3.4). 

Since the original release of RootPDE, the author has been including more 
advanced features and developed methods which allow a kernel estimate ob- 
ject to write out a C ++ function for use in other applications. Furthermore, 
the interface has been generalized from pure discriminant analysis to a form 
more amenable to any of the applications described in Section 2. Currently, 
this development is taking place at BaBar; however, it will be made publicly 
available in the near future. 



3.2.3 WinPDE 

WinPDE is very similar to the original release of RootPDE described in Sec- 
tion 3.2.2; however, the interface is written in VisualBasic and intended for 
Microsoft Windows platforms. Furthermore, WinPDE can interface with Mi- 
crosoft Excel spreadsheets and offers a very intuitive user-interface. 



electronic mail: miettinen@physics.rice.edu 

This work was partially funded by an Envision grant from Rice University 
electronic mail: padley@physics.rice.edu, askew@physics.rice.edu 
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4 Conclusions 



4-1 Comparison with SMOOTH 

It seems appropriate to put kernel estimation techniques in a proper setting 
before concluding with a discussion of their inherent benefits. While kernel 
estimation techniques may be applied to situations in which a parametric es- 
timates are popular, that comparison has essentially been made by Section 1.1. 
Instead, let us consider perhaps the most widely used non-parametric density 
estimation technique in high-energy physics: PAW's SMOOTH utility. 

4.1.1 SMOOTH 

A full development of the HQUADF function that is used by PAW's SMOOTH 
utility is beyond the scope of this paper. However, a brief outline of the algo- 
rithm is presented. First and foremost, it is important to realize that SMOOTH 
operates on histograms and not on the original data set {ti}. Thus, SMOOTH 
is dependent on the original binning of the data. SMOOTH was introduced 
by this journal in John Allison's 1993 paper (0). We will restrict ourselves 
to the univariate case. Essentially SMOOTH works by finding the bins / of 
significant variation in the histogram {hi} and then using those points to con- 
struct a smoothed linear interpolation. Bins of significant variation are those 
which satisfy Si > S*, where S* is a user-defined significance threshold and 



With the points of significant variation {xi} in hand, the smoothed shape is 
given by 



where <j>iir) = yr 2 + Aj are the radial basis functions. The A; are user-defined 
smoothness parameters (radii of curvature). The ai are found by minimizing 
the x 2 between s(x) and the original histogram. As Allison pointed out "lower 
X 2 can be obtained by reducing the cut on Si at the expense of following more 
of what might only be statistical fluctuations." By a different choice of S* and 
A, the user has the power to magnify or remove statistical fluctuation in the 
data. 



h l+l - 2hi + hi-i 



(19) 




s ( x ) = J2ai4>i(\ x ~ x i\) 



(20) 
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4-1-2 Comparison 



Despite the user-specified parameters S* and A, SMOOTH is a non-parametric 
estimate of a probability density function based on a set of data. The primary 
differences between SMOOTH and kernel estimates are their approach and 
their rigor. While kernel estimates are bin-independent constructions of the 
estimate, SMOOTH is a parameter-dependent fit of the estimate to a user- 
provided histogram. Practically speaking, kernel estimates are based on well 
defined statistical techniques and SMOOTH'S estimates are adjusted by eye 
allowing for user bias and large systematic uncertainty. 



4-2 Systematic Errors 



When kernel estimation techniques are applied to confidence level calculations 
or parameter estimation, systematic effects become of particular importance. 
One may loosely classify the systematic errors associated with probability den- 
sity estimation as either inherent or user-related errors. In its pure form kernel 
estimation techniques are entirely deterministic and have no user-specified pa- 
rameters. If one decides to free the value of p from its nominal value of unity 
(see Equation 7) or allow d! ^ d, then user-related systematic error are in- 
troduced. For SMOOTH, the user-related parameters S* and A can not be 
avoided. In addition to the possible user-related systematic errors, there are 
inherent systematic errors introduced by any probability density estimation 
technique. For parametric estimates, this inherent systematic is related to the 
quality of the model; while for non-parametric estimates, this inherent system- 
atic is related to the flexibility of the technique. The development of kernel 
estimation techniques has been directly focused on flexibility and the mini- 
mization of a particular choice of inherent systematic error: the asymptotic 
mean integrated squared error (|l|). 

In practice, an experimentalist will want to choose their own estimate of the 
inherent systematic error (e.g. the effect on the measured value of a parameter 
or 95% confidence level limit). This can be done in a variety of ways that effec- 
tively reduce to producing a family of estimates from independent samples of 
the same parent distribution. This family may be obtained by simply splitting 
up the data or via toy Monte Carlo simulation. Because the systematic error 
introduced by the estimation technique is a function(al) of the sampled parent 
distribution (which is unknown), the estimate itself is the best available choice 
of the parent distribution to be sampled in a Monte Carlo study. 
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4-3 Remarks 



Obviously, kernel estimation techniques are very powerful and very relevant 
to high-energy physics. While these techniques have been applied to a wide 
range of analyses, they seem to by largely unknown by the community. It is the 
aim of this paper to describe kernel estimation techniques, their applications, 
and their current implementations in order to promote their use. It is to the 
author's personal gratification to be part of a cross-polination of ideas between 
fields - in this case statistics and high-energy physics. 
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